Normalizing document metadata using directory services

ABSTRACT

The present invention provides methods, systems, and computer program products for normalizing document search terms through use of an alias database, as may be found in an alias relationship file, such as a directory service. A gatherer module receives as input (or crawls through) several documents in series or in parallel and can recognize data segments as related to one of the aliases in the alias relationship file. The gatherer then associates the document appropriately so that a search engine may find all documents associated with a search term, regardless of whether the term has undergone several name changes (various aliases) over the course of time. Accordingly, a user may then search for a person&#39;s name, and receive as a search result all documents listing the person&#39;s name, as well as documents listing, for example, only the person&#39;s email address.

BACKGROUND OF THE INVENTION

[0001] 1. The Field of the Invention

[0002] This invention relates to systems, methods, and computer programproducts for improving computerized search functions by synchronizingdocument metadata using directory services.

[0003] 2. Background and Relevant Art

[0004] Computerized environments have increased the efficiency by whichpeople perform a wide variety of tasks. For example, computers andcomputer networks have vastly improved the speed and capabilities bywhich people communicate ideas to each other. Computerized systems alsoprovide people with enhanced tools for fixing varyingly complex thoughtsinto an easily accessible medium, which provide far more options thantypewriters, pens, pencils, and notepads. Thus, computerized systemsgreatly enhance information access, and authoring power. In theseregards, the advantages of computerized systems are well known.

[0005] With regard to information creation, one can author (or create)information as simply as by typing one or two basic text paragraphs in adocument that the author may wish to send to another over electronicmail (E-Mail). In other cases, one or many authors may generatethousands of pages in a word processing document, where the wordprocessing document may include several spoken languages, may containseveral graphics and other multi-media content, and may comprise a widevariety of electronic formats. In any case, common electronic tools suchas a word processor, a web page creator, a text form, and so on helpauthors affix huge amounts of information into a wide variety ofaccessible electronic media.

[0006] Computerized systems have also enhanced the speed and ability tolocate and access this information created by others. Information isaccessed and distributed using any one of a number of differenttechniques and applications, including electronic mail; distributednetworks including the Intranet and corporate intranets; databasestorage and access systems; and the like. However, the overwhelmingamount of data and information that is accessible has given rise toproblems. In particular, the ability to specifically locate a relevantpiece (or pieces) of information, such as a document, from a large anddistributed database of information, such as the Internet or a largecorporate intranet, has proved to be increasingly difficult.

[0007] To address this problem, various types of search tools, sometimesreferred to as “search engines,” have been developed. While any one of adifferent number of techniques and search algorithms are used, ingeneral users of a search engine typically enter one or more searchterms, and search results corresponding to those terms are returned bythe search engine. In a typical implementation, a person may visit anInternet or local webpage that employs a functional text-input box. Theperson enters one or more search terms into the text box and the searchengine may return to the person one or more related documents, dependingon the specificity and nature of the search request.

[0008] Search engine implementations vary in complexity and capability.A very simple search engine, for example, may only search for exactspellings of certain words within an opened document. Thus, if a userwere to type the misspelled word “medixcal” into a search box, thesimple search engine will not likely return any results, or point theuser to any meaningful point in the document, unless that exactmisspelling has been made within the document. A more complex searchengine, however, may allow a user to search millions upon millions ofdocuments based on a wide variety of criteria, even allowing the user toadd detailed restrictions, all the while compensating for misspellings.For example, a complex search engine may allow a user to search millionsof documents on a local or wide area network for the terms “FyreEngine”+“fire pole”+“Dennis Finch”, with the restriction that allresults must be in English, and that the resultant web page be createdafter the year 1998. In some cases, the search engine may even correctthe spelling of “Fyre Engine” to “Fire Engine”, prior to executing thesearch.

[0009]FIG. 1 illustrates a prior art depiction of one example of animplementation of a search engine. In this example, the search enginealgorithm first obtains several documents (or any analogous discreteunit of information) 105, 110 into a database such as index service 100.In one approach, a user may select and enter documents 105 and 110manually into the index service 100 to be processed. Alternatively, theindex service (or related search service) may have a function thatautomatically locates and obtains documents. This function is sometimesreferred to as “crawling,” where the service continually “crawls” acrossmultiple documents on the network by following document reference linkswithin certain documents, and then processing each document as found.The index service 100 processes the documents by identifying key wordsor general text in the documents 105, 110, and then creating an invertedlist 120 (more generally, an “index”).

[0010] An exemplary inverted list can be one or more electronicreference documents having a column containing a list of key words, acolumn containing one more documents containing the key word, a columnfor the number of occurrences of that key word in the respectivedocument, and a column with an address for each associated document. Forthe purposes of illustration, however, a more simple inverted list 120is shown having a column of words (A, B, C, etc.), and a columnindicating in which document those words can be found.

[0011] When a user enters in one or more search terms (e.g., “Requestfor A” 132), a typical search engine 130 will employ an algorithm thatfirst finds the one or more terms among the key words in the invertedlist 120, and then weighs the resultant documents associated with anyfound words in the list. The search engine can then return one or moreof the associated document references as results 136 to the user,depending on how the search algorithm is configured (i.e., documentshaving the most occurrences of the word), or depending on anyrestrictions the user places on the search (i.e., requiring an exactphrase match). Consequently, search engines can be quite useful forlocating and accessing information contained within, for example, adistributed network environment.

[0012] Search engines such as the foregoing, however, tend to havecertain limitations. Since the typical such search engine relies on agenerated index to locate documents, the relevance of a give searchresult is highly dependent on the document content that is used toultimately construct the index. For example, a document containing onlythe words “whale,” “fish,” “ocean,” and “ferry” would not be found bysome search engines if a user entered the terms “orca,” “tuna,” “sea,”and “transport.” This is because, in general, search engines of the typedescribed do not generate alternate word relationships when building aninverted list. While this type of search engine may provide automaticspell-checking of search terms, they do not automatically search wordvariants, synonyms, and homonyms, unless the user specifically entersthem.

[0013] In addition, there are other problems that can complicate theamount and quality of data that a search engine can return to a userseeking information. For example, a large organization may havethousands upon thousands of internal documents on various topics postedon various servers on the local or wide area network. While each posteddocument may contain different metadata corresponding to metadataconcepts (i.e., document identifiers) such as author, date created,size, title, etc., each document may be created with different programsthat identify metadata properties differently, or describe theunderlying data differently. For example, one document might includeauthor metadata as: “Author=‘Heather F. Pettingill’” while anotherdocument's metadata might designate the author as: “By=‘H. Pettingill’”while yet another document might contain no author metadata and merelyinclude the phrase “H. F. Pettingill” centered at the top of the firstpage. Thus, if an index were created that includes author metadata, asubsequent search may not locate some of these documents if a searchwere performed for the author “H. F. Pettingill.”

[0014] Even if the metadata format is standardized within theorganization, the underlying data values that employees may use toclassify documents within a general concept in the organization canoften undergo several changes. For example, employees may refer toseveral documents under the classification of “Product Design” one year,and then “Manufacturing Policies” the next year when referring to thesame general concept or classification. Similarly, a person's name orcontact information may change several times over the course of theiremployment (i.e., due to name changes, email alias changes, new emaildomain name, new preferences, new office, new workgroup, etc.).

[0015] As such, this can degrade the effectiveness of searches, forexample, for all documents authored by “Heather Pettingill,” or for alldocuments discussing product release policies over the last three tofive years. For example, with specific reference to FIG. 1, if therewere no direct correlation made between the values A and X on invertedlist 120 such that A=X (e.g., “Heather Pettingill”=“Heather Martin” dueto a marital name change), a normal search for A or X would only return“Doc” or “Doc2” (but not both) as a result 136. Typically, the only waythe search engine might return both documents is if the user searchedfor both terms based on prior knowledge of the term correlation. Ofcourse, this approach is limited by the fact that the user may notrealize the correlation, though the user wishes to have all documentsauthored by the person in question.

[0016] Accordingly, there is a need for more robust systems, methods,and computer program products that relate the types of informationavailable to a search engine so that more accurate search results can beobtained, without requiring a user to iteratively search severalvariations of the same terms and phrases. In addition, there is a needfor robust systems, methods, and computer program products that allowusers to search returned results for additional relationships, such asby metadata concepts, or classification data.

BRIEF SUMMARY OF THE INVENTION

[0017] The present invention solves one or more of the foregoingproblems in the prior art by introducing systems, methods, and computerprogram products for normalizing data—such as metadata—for use in asearching algorithm. In one embodiment, a filtering or indexing serviceretrieves alias information from one or more accessible directoryservices. The inventive method retrieves this information to normalizesuch things as contact information, classifications, metadatareferences, and so on. For example, if the accessed directory service isa personal directory service, the directory service might include adatabase of all versions of a name by which a person has been (orcurrently is) recognized. Each version of the person's name canconstitute an alias (or, alternate identity).

[0018] The directory service may also include various other forms of theperson's contact information such as email aliases, prior and currentworkgroups, office locations and the like. In the case of personalcontact information, each of the person's name aliases would thenconstitute a normalize-able identifier for the same person. The indexingor filtering service may also refer to other forms of information in thedirectory service as a type of class such that every alias of a person'sname might constitute an “Authorship” class, whereas other types ofalternate information might constitute a “Workgroup” class, etc. Eachclass may also be identifiable by one or more aliases (or, alternateidentifiers).

[0019] Once such information has been normalized (or, “aliased”) fromthe directory service, a gathering service receives one or moredocuments as inputs, or crawls through various document links on anetwork, and identifies information segments in the documents asrelating to the normalized data. The gathering service then stores theserelationships in a way that is accessible to a search engine. Thus, whena user enters a search term into a search engine input box, the searchengine consults the alias and/or classification databases asappropriate, and returns a more robust result.

[0020] Thus, a user can enter a person's name in one form and get aresult from the search engine of all the documents related to thatperson's name, in whatever form the person has used, without requiringuser knowledge of each of the forms. In addition, a user can make aninitial search (or search returned results) to find additional propertydata from the search results that may relate to one or more aliasedclasses. For example, the user may enter a search of all documentscreated for the new topic name “Product Design”, and the search enginecan return all the documents related to “Product Design”, withoutmissing documents with a different topic name (e.g., “ManufacturingSpecs”).

[0021] Additional features and advantages of the invention are set forthin the description which follows, and in part will be obvious from thedescription, or may be learned by the practice of the invention. Thefeatures and advantages of the invention may be realized and obtained bymeans of the instruments and combinations particularly pointed out inthe appended claims. These and other features of the present inventionwill become more fully apparent from the following description andappended claims, or may be learned by the practice of the invention asset forth hereinafter.

BRIEF DESCRIPTION OF THE DRAWINGS

[0022] In order to describe the manner in which the above-recited andother advantages and features of the invention cam be obtained, a moreparticular description of the invention briefly described above will berendered by reference to specific embodiments thereof which areillustrated in the appended drawings. Understanding that these drawingsdepict only typical embodiments of the invention and are not thereforeto be considered to be limiting of its scope, the invention will bedescribed and explained with additional specificity and detail throughthe use of the accompanying drawings in which:

[0023]FIG. 1 illustrates a prior art depiction of an indexing andsearching service.

[0024]FIG. 2 illustrates one embodiment of representing document dataobjects and identifiers in accordance with the present invention.

[0025]FIG. 3 illustrates one embodiment of an overview block diagram ofinputs and outputs of a gatherer data structure.

[0026]FIG. 4 illustrates a flow chart that may be used to implement agathering process in accordance with the present invention.

[0027]FIG. 5 illustrates a flow chart of acts and functional steps forpracticing one embodiment of the present invention.

[0028]FIG. 6 illustrates a suitable computing environment that may beused to practice the present invention.

DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS

[0029] The present invention extends to both methods and systems fornormalizing document metadata using directory services. Disclosedembodiments may comprise a special purpose or general-purpose computerincluding various computer hardware implementations, as discussed ingreater detail below.

[0030] Embodiments within the scope of the present invention alsoinclude computer-readable media for carrying or havingcomputer-executable instructions or data structures stored thereon. Suchcomputer-readable media can be any available media that can be accessedby a general purpose or special purpose computer. By way of example, andnot limitation, such computer-readable media can comprise RAM, ROM,EEPROM, CD-ROM or other optical disk storage, magnetic disk storage orother magnetic storage devices, or any other medium which can be used tocarry or store desired program code means in the form ofcomputer-executable instructions or data structures and which can beaccessed by a general purpose or special purpose computer.

[0031] When information is transferred or provided over a network oranother communications connection (either hardwired, wireless, or acombination of hardwired or wireless) to a computer, the computerproperly views the connection as a computer-readable medium. Thus, anysuch connection is properly termed a computer-readable medium.Combinations of the above should also be included within the scope ofcomputer-readable media. Computer-executable instructions comprise, forexample, instructions and data which cause a general purpose computer,special purpose computer, or special purpose processing device toperform a certain function or group of functions.

[0032] Reference is next made to FIG. 2, which illustrates an exemplaryrepresentation of document data objects and identifiers that might beused in connection with an embodiment of the present invention.Illustrated is an electronic document 200, which may be of a widevariety of formats. In the example, document 200 can be broken down intoat least two component parts: a content portion 205 and a metadataportion 210. In this example, the content portion 205 is the dataentered by a user creating the document, e.g., the text of a wordprocessing document. The contents 205 can be broken down into sectionsthat represent size portions of the document, for example by page, or bywords, phrases, and letters in the case of a text document. Here, thedocument 200 has multiple “content segments” (also referred to as“chunks”) such as a name “Aaron J” 202, “Surgery” 206, and “Pediatric”208. Each of these content segments can be representative of aparticular property of the document.

[0033] The other component part of the document, referred to here asmetadata 210, represents specific properties or characteristicsassociated with the document. Metadata may be entered by a user whencreating a document, or may be entered through default settings in anapplication program. For example, one might discover metadata by openinga “properties” portion of an opened document, or by some similar method.Thus, metadata 210 are meant to give information that can helpcharacterize the contents in some way. By way of example, the metadata210 in FIG. 2 include metadata concepts (or “segments”), such as an“Author” segment 210, a file “Size” segment 214, and a “Keywords”segment 216. The metadata concepts are like classifications in the sensethat they can be more generalized concepts of the more specific,underlying data content 205. Thus, a class is appropriately defined as aset of documents, and a label (e.g., all documents “Authored” by personA).

[0034] For example, the “Author” segment 210 indicates that “AEJ” and“NDS” are authors of the document 200. It could be that both “AEJ” and“NDS” manually entered this information when initially creating thedocument, or that an application may have merely tracked thisinformation as both “AEJ” and “NDS” entered information into thedocument from different locations. The “Size” segment 214 indicates thatthe size of the document 200 is 52 Kb. The “Size” segment 214 could be avalue that is automatically updated when the document is saved, forexample. The “Keywords” segment 214 includes keyword data thatrepresents, for example, the general concepts addressed by the contentof the document. Here, keywords segment 214 shows that the document 200content is related to the terms “Science, Med, Osteo, Orthopedic,Surgery, and Pediatric.” Again, these keywords could be manuallyentered, or could be automatically extracted from the content of thedocument itself.

[0035]FIG. 2 also shows an exemplary directory service 230, as can beused in connection with the example implementation. In the example, thedirectory service may contain one or more entries (such as is denotedhere at 232 and 234), that each contain one or more data fields. Forexample, the entry 232 is for “Aaron Jones.” That entry includes datafields for additional name aliases “AAJ,” “AEJ,” as well as differentemail addresses at different Internet Service Providers (ISPs). Inaddition, there is a secondary identifier “ID:ORTHOPEDICS.” Similarly,there is an entry 234 for Nathan Smith, which includes data fields forname aliases “NDS” and “NAS,” as well as different email addresses atdifferent ISPs. This entry also includes a data field for a secondaryidentifier, shown here as “ID:PEDIATRIC SURGERY.” It will be appreciatedthat these types of data fields are merely examples, and that any one ofa different number and/or type of data field may included in aparticular entry. Also, it will be appreciated that a directory servicemay be as complex as a centralized, relational database in anorganization, or may be as simple as a single text document relating oneor more terms to one or more other corresponding alternative terms.

[0036] As is further shown in FIG. 2, each of the data field indicia(and other fields not shown) contained in the relative entries 232, 234can be parsed by a gatherer function, denoted at 260, to provide anindividualized association database 240 (e.g., an “alias” database). Inthe example shown, the association database 240 contains informationfrom the parsed directory service 230. More particularly, in thisexample each contact entry 232, 234 is associated with one or moreclasses, denoted here at 246 and 256. As will be explained further,classes may be based on a general metadata concept, or may comprise awide variety of other types of designations. In the example shown, onetype of class in the database 240 could be authorship, and each contactentry 232, 234 is associated with one or more documents 200. Theassociation database 240 can be configured to relate a wide variety ofdata from the directory service, and may be combined with a searchableindex (e.g., the inverted index 120 in FIG. 1) to provide an enrichedinverted index, as will be discussed in more detail.

[0037]FIG. 3 illustrates one embodiment of the methodology by which thegatherer function, here designated at module 300, generates a searchableindex 360 based on a variety of inputs. A gatherer 300 may be afunctional executable program, or may be a plug-in module to a morecomplicated overall executable program such as a search engine. Ineither case, a gatherer 300 references one or more directory services(denoted in the example as 340 and 350) to ascertain a collection ofaliases for one or more terms, as described in FIG. 2. The gatherer 300may reference the directory services as the gatherer 300 processesvarious documents on the network, or the gatherer may receive a list ofvalues (e.g., aliases) from the directory service in the form of a filethat the gatherer recognizes as a definition set of aliases fordifferent terms. The provided aliases allow the gatherer to have a“normalized” frame of reference for associating a network document withan entity or class. In other words, the gatherer 300 can now recognizethat several word variants mean the same thing.

[0038] To process documents, the gatherer 300 may first receive adocument 305 as an input in some cases, such as a user entering certaindocument metadata before the user saves the document 305 on a networkdatabase. Or, the gatherer 300 may “crawl” throughout documents on alocal or wide area network. As noted, crawling refers to a programmodule following links between various documents on a network in orderto find and process additional documents. A gatherer 300 may crawl anetwork on its own initiative based on preset parameters, or a user may“seed” the gatherer 300 with a first network-based Uniform ResourceIdentifier (URI) 302. Once the gatherer 300 receives an input seed, thegatherer 300 may then follow any links or references contained withinthe document located at the first URI 302, following a trail through asecond URI 315, a third URI 320, and a fourth URI 325 as it processesthe respective documents.

[0039] The gatherer 300 processes the respective document by breakingthe encountered document into one or more segments. In some cases, thegatherer 300 may first need to decode the document it is processing intoa readable format. Once the gatherer 300 can read the format, and hasdivided the document into segments, the gatherer 300 attempts toidentify portions of the segments based on the normalized aliasinformation the gatherer 300 has referenced from the one or moredirectory services (e.g., 340, 350). For example, the segments may besome form of defined metadata concepts (e.g., “Author,” “Type,”“Workgroup,” etc.), or some type of non-defined text (e.g., underlinedsection headings).

[0040] For example, the gatherer 300 might identify URI₂ 315 as havingan “Author” value of “NDS”, and a “Keyword” value of “pediatrics”.Alternatively, the gatherer 300 may find no such formal metadataconcepts in URI₄ 325, but notice that the name “AJONES@ISP.NET” iscentered at the top of the document, and so assign that document an“Author” value of “AJONES@ISP.NET.” After consulting the aliasreferences, the gatherer 300 could then identify that “NDS”=“NathanSmith” 234, and that “AJONES@ISP.NET”=“Aaron Jones” 232, along withother values, and then place those associations for the documents foundat the respective URIs in a searchable index 360.

[0041] An exemplary searchable index 360 may have a defined set ofclasses 370 such as an “Author” class, a “Workgroup” class, and mayfurther include a set of class aliases by which each class might beknown in the organization, as well as a list of documents or entitiesthat could belong to those classes. The searchable index 360 may alsoinclude a collection of entities 380 that relate to each of the classes.Entities may represent persons (e.g., a collection of aliases thatidentify the person) or class members, are used to establishrelationships across classes, and are used for alias mapping. Classmembers are usually documents, or indexable entities having someproperty (i.e., metadata) that identifies them with a certain class. Aclass is then a “set” (i.e., a collection of objects that can beidentified as belonging to the set) of these class members, and can berepresented as a “set” in the mathematical sense. Thus, as in theexample above, contact entry 232 for “Aaron Jones” may be represented asan entity in a searchable index, where the entity may include severalname and group aliases, and may also include other types of valuesrelating to one or more classes.

[0042]FIG. 4 illustrates a flow chart showing the functional steps thatmay be used to implement one example implementation of the gatheringprocess, particularly in the context of recognizing classes. Forpurposes of illustration, the gatherer is assumed to have alreadyreceived one or more values from a directory service such that thegatherer has a reference for one or more aliases and or one or moreclasses. Continuing with FIG. 4, a document 400 is received by agatherer (e.g., gatherer 300), and then filtered 405 into one or moredocuments segments. Once the document has been filtered 405 into one ormore segments, the gatherer processes the segment 407 by determining ifthe segment represents the end of the document 410. For example, thegatherer may look for a document-ending marker such as an End Of File(EOF) designator.

[0043] If the segment does not signal the end of the document 412, thenthe gatherer processes the segment further to determine if the segmentmatches one of the class types (e.g, “By AEJ”). If the segment appearsto be a valid class type 425, then the gatherer will seek to identify430 whether there is a known class alias for the apparent class type(e.g., “By=Author=From=Written By”). If there is no known class aliasfor the segment so that “By” represents any specific meaning, thegatherer will move to the next segment 450 for processing. If on theother hand, there is 445 an alias identifying the class type (e.g., “By”is recognized as an “Author” metadata concept), then the gathererassociates 460 the segment with the appropriate class by mapping thesegment value to a preferred class label, and looks to see if anymetadata associated with the class type are also aliases (i.e.,“normalized” for a standard value—“AEJ=‘Aaron E. Jones’”) so that theclass type property may be called for the normalized metadata (or aliasfor the normalized metadata) and vice versa. Having associated 460 thesegment with any available classes and/or corresponding metadata, thegatherer adds the segment to a searchable index (e.g., 360) byassociation with the normalized value, and then proceeds to get the nextsegment 450.

[0044] If the next segment 450 resembles an end of document designator410, the gatherer determines if there is a next document 470 to filterinto segments. For example, the gatherer may detect that there is anembedded URI for a next document (e.g., “crawling”), or the gatherer maydetect a next document as input received in queue from another inputmethod (e.g., a file list). If the gatherer detects a next document,495, then the gather proceeds to filter the next document into segments,and continues the previously-described processing path. Alternatively,if the gatherer detects no new document, the gatherer ceases processingdocuments 480.

[0045] Example embodiments of the present invention also may bedescribed in terms of methods comprising functional steps and/ornon-functional acts. The following is a description of acts and stepsthat may be performed in practicing one exemplary embodiment. Usually,functional steps describe the invention in terms of results that areaccomplished, whereas non-functional acts describe more specific actionsfor achieving a particular result. Although the functional steps and nonfunctional acts may be described or claimed in a particular order, thepresent invention is not necessarily limited to any particular orderingor combination of acts and/or steps.

[0046]FIG. 5 illustrates a flow chart of acts and functional steps forpracticing one embodiment of the present invention, and will bedescribed with reference to the foregoing figures. As shown, theinventive method may begin by performing the act of first receiving 500a document 205 containing document data. As described, the document maycontain specific references to metadata concepts and may have as valuesthe underlying metadata corresponding to the metadata concepts. Havingreceived the document 500 as input (or having crawled to the documentvia a URI), the inventive method may parse 510 the document into one ormore document segments. This may be done by, for example, filtering thedocument 405 into document segments as described in FIG. 4. Thesefiltered segments may as simple as data blocks having text values 202,206, and 208, or may be more complicated in the sense of containing textand textual relationship values to a corresponding class or metadataconcept, such as data blocks 210, 214, and 216.

[0047] Thereafter, the inventive method performs the functional step 520of normalizing document metadata used as a reference by a search engine.This step can include normalizing document metadata used as a referenceby a search engine by maintaining one or more relationships between asearch term and an alternate search term, a search term property oralternative search term property. Functional step 520 may be performedby the specific non-functional act of identifying 530 at least one ofthe one or more document segments as an alias; and by a non-functionalact of associating 540 the received document with the document alias.

[0048] Act 530 may be performed by identifying at least one of the oneor more document segments as an alias for a document datum found in analias directory service. Thus, for example, the inventive method mayprocess a directory service and then store a series of aliases thatrefer to a main text value, such as in the case of “Aaron E. Jones”(i.e., contact entry 232) being recognized by several monikers such as“Aaron Jones”, “AAJ”, “AEJ, “AEJONES@ISP.NET”, etc. Similarly, theinventive method may additionally or alternatively store class typevalues so that, for example, a search for all documents authored by aperson can be found without regard to the various aliases the person hasused over time. A document datum found in an alias directory mightinclude, for example, the name “NDS” as an alias for the name “NathanSmith” in contact entry 234, and that “Nathan Smith” is furtherassociated (via an associations database 240) as an “Author”(e.g., class256) of a particular document (e.g., document 200).

[0049] Act 540 may be performed by associating the received documentwith the document alias so that, upon request for the document datumthrough a search engine, the received document is returned to therequester by association of the document datum with the alias. Thus, forexample, if a person enters the term “Nathan Smith” into a search engineinput box and then executes the search, the search engine will look forany occurrence of the term “Nathan Smith”. The search engine will alsolook for all documents containing the associated aliases (e.g., terms“NDS”, and “NAS”, from contact entry 242 of FIG. 2), since the term“Nathan Smith,” and each of the received aliases have been normalized.

[0050] In addition, the concept of normalizing the class type (e.g.“Author”) can be applied to train a statistical model used by thegatherer based on each set of documents (tagged with all alias forms)belonging to the respective class. That is, embodiments of the presentinvention may progressively modify its understanding of search terms tointerrelate and associate concepts and metadata more correctly, or to“fine-tune” how the gatherer 300 relates documents and document data.For example, after crawling various documents on a network, the gatherer300 may be trained to correctly discover other related documents to thegiven author, and other people related to the author, or to identifyother concepts related to documents associated with the author, and soon. Accordingly, a statistical model can be trained for each class, sothat there are as many statistical models as there are classes (e.g., inthe case of authors, there would be a model for each author). Since astatistical model is usually represented as a set of key terms (i.e.,keywords) with associated weights (i.e., importance), it is important toidentify the set of documents that belong to the class correctly. Usingaliases, as described herein, is one way of identifying the correct setof documents, and, ultimately, training of the statistical model.

[0051] Those skilled in the art will appreciate that the invention maybe practiced in network computing environments with many types ofcomputer system configurations, including personal computers, hand-helddevices, multi-processor systems, microprocessor-based or programmableconsumer electronics, network PCs, minicomputers, mainframe computers,and the like. The invention may also be practiced in distributedcomputing environments where local and remote processing devices performtasks and are linked (either by hardwired links, wireless links, or by acombination of hardwired or wireless links) through a communicationsnetwork. In a distributed computing environment, program modules may belocated in both local and remote memory storage devices.

[0052]FIG. 6 and the following discussion are intended to provide abrief, general description of a suitable computing environment in whichthe invention may be implemented. Although not required, the inventionwill be described in the general context of computer-executableinstructions, such as program modules, being executed by computers innetwork environments. Generally, program modules include routines,programs, objects, components, data structures, etc. that performsparticular tasks or implement particular abstract data types.Computer-executable instructions, associated data structures, andprogram modules represent examples of the program code means forexecuting steps of the methods disclosed herein. The particular sequenceof such executable instructions or associated data structures representsexamples of corresponding acts for implementing the functions describedin such steps.

[0053] With reference to FIG. 6, an exemplary system for implementingthe invention includes a general-purpose computing device in the form ofa conventional computer 620, including a processing unit 621, a systemmemory 622, and a system bus 623 that couples various system componentsincluding the system memory 622 to the processing unit 621. The systembus 623 may be any of several types of bus structures including a memorybus or memory controller, a peripheral bus, and a local bus using any ofa variety of bus architectures. The system memory includes read onlymemory (ROM) 624 and random access memory (RAM) 625. A basicinput/output system (BIOS) 626, containing the basic routines that helptransfer information between elements within the computer 620, such asduring start-up, may be stored in ROM 624.

[0054] The computer 620 may also include a magnetic hard disk drive 627for reading from and writing to a magnetic hard disk 639, a magneticdisc drive 628 for reading from or writing to a removable magnetic disk629, and an optical disc drive 630 for reading from or writing toremovable optical disc 631 such as a CD ROM or other optical media. Themagnetic hard disk drive 627, magnetic disk drive 628, and optical discdrive 630 are connected to the system bus 623 by a hard disk driveinterface 632, a magnetic disk drive-interface 633, and an optical driveinterface 634, respectively. The drives and their associatedcomputer-readable media provide nonvolatile storage ofcomputer-executable instructions, data structures, program modules andother data for the computer 620. Although the exemplary environmentdescribed herein employs a magnetic hard disk 639, a removable magneticdisk 629 and a removable optical disc 631, other types of computerreadable media for storing data can be used, including magneticcassettes, flash memory cards, digital versatile disks, Bernoullicartridges, RAMs, ROMs, and the like.

[0055] Program code means comprising one or more program modules may bestored on the hard disk 639, magnetic disk 629, optical disc 631, ROM624 or RAM 625, including an operating system 635, one or moreapplication programs 636, other program modules 637, and program data638. A user may enter commands and information into the computer 620through keyboard 640, pointing device 642, or other input devices (notshown), such as a microphone, joy stick, game pad, satellite dish,scanner, or the like. These and other input devices are often connectedto the processing unit 621 through a serial port interface 646 coupledto system bus 623. Alternatively, the input devices may be connected byother interfaces, such as a parallel port, a game port or a universalserial bus (USB). A monitor 647 or another display device is alsoconnected to system bus 623 via an interface, such as video adapter 648.In addition to the monitor, personal computers typically include otherperipheral output devices (not shown), such as speakers and printers.

[0056] The computer 620 may operate in a networked environment usinglogical connections to one or more remote computers, such as remotecomputers 649 a and 649 b. Remote computers 649 a and 649 b may each beanother personal computer, a server, a router, a network PC, a peerdevice or other common network node, and typically include many or allof the elements described above relative to the computer 620, althoughonly memory storage devices 650 a and 650 b and their associatedapplication programs 636 a and 636 b have been illustrated in FIG. 6.The logical connections depicted in FIG. 6 include a local area network(LAN) 651 and a wide area network (WAN) 652 that are presented here byway of example and not limitation. Such networking environments arecommonplace in office-wide or enterprise-wide computer networks,intranets and the Internet.

[0057] When used in a LAN networking environment, the computer 620 isconnected to the local network 651 through a network interface oradapter 653. When used in a WAN networking environment, the computer 620may include a modem 654, a wireless link, or other means forestablishing communications over the wide area network 652, such as theInternet. The modem 654, which may be internal or external, is connectedto the system bus 623 via the serial port interface 646. In a networkedenvironment, program modules depicted relative to the computer 620, orportions thereof, may be stored in the remote memory storage device. Itwill be appreciated that the network connections shown are exemplary andother means of establishing communications over wide area network 652may be used.

[0058] The present invention may be embodied in other specific formswithout departing from its spirit or essential characteristics. Thedescribed embodiments are to be considered in all respects only asillustrative and not restrictive. The scope of the invention is,therefore, indicated by the appended claims rather than by the foregoingdescription. All changes that come within the meaning and range ofequivalency of the claims are to be embraced within their scope.

We claim:
 1. In a computerized environment, a method of normalizingdocument data to improve the results of search requests, the methodcomprising the acts of: receiving a document containing document data;parsing the document data into one or more document segments;identifying at least one of the one or more document segments as analias that correlates with a document datum found in an alias directoryservice; and associating the received document with the document aliasso that, upon request for the document datum through a search engine,the received document is returned to the requester by association of thedocument datum with the alias.
 2. The method of claim 1, wherein thedocument data are metadata, and wherein the alias is a documentmetadatum.
 3. The method of claim 1, further comprising identifying asecondary document reference contained within the received document;parsing the secondary document into secondary document segments;identifying the secondary document segment with a secondary alias, andassociating the secondary document segment with the secondary alias, thesecondary document, and the received document.
 4. The method of claim 1,wherein the alias directory service is a contact database containing oneor more aliases for one or more terms associated with one or morecorresponding contacts.
 5. The method of claim 1, further comprisingidentifying the document segment as part of a predefined class or aclass alias, so that a data request through a search engine returns therequested data to the requester when the requester enters one or more ofthe identified class, the class alias, and the alias.
 6. The method ofclaim 5, wherein the class is one or more of a weighted value for one ormore associated terms, a metadata concept, and a property type.
 7. Themethod of claim 5, further comprising associating a term in an invertedindex with one or more of the identified alias and the predefined classor class alias; and storing the inverted index for use by a searchengine.
 8. The method of claim 6, wherein the property type is anauthorship property.
 9. The method of claim 6, wherein a classificationmodule further implements the method comprising: identifying a nextdocument containing next document data that can be identified with theclass, whereby the class comprises at least the document containingdocument data and the next document containing next document data; andbased on the document data and the next document data, identifyingadditional documents within the class, so that the classification moduleis trained to associate additional documents with the class that wouldnot have otherwise been identified.
 10. In a computerized environment, amethod of normalizing document data to improve the results of searchrequests, the method comprising: an act of receiving a documentcontaining document data; an act of parsing the document data into oneor more document segments; and a step for normalizing document metadataused as a reference by a search engine by maintaining one or morerelationships between a search term and an alternate search term, asearch term property or alternative search term property.
 11. The methodof claim 10, wherein the step for improving future search resultsreturned to a requester of a requested term includes: an act ofidentifying at least one of the document segments as an alias for adocument datum found in an alias directory service; and an act ofassociating the received document with the document alias so that, uponrequest for the document datum through a search engine, the receiveddocument is returned to the requester by association with the alias. 12.The method of claim 10, further comprising receiving, by a gatherermodule, directory service data that include one or more aliases for ametadatum.
 13. The method of claim 12, further comprising parsing thedirectory service data; and associating the parsed data with one or moreclasses so that one or more documents are related by one or morecorresponding classes and one or more metadata aliases.
 14. The methodof claim 13, wherein the directory service data are contained in one ormore of a contact database and a text file having delimited values,wherein the delimited values equate one or more alternative terms for anormalized value.
 15. A computer program product havingcomputer-executable instructions for performing a method of normalizingdocument data to improve the results of search requests, the methodcomprising the acts of: receiving a document containing document data;parsing the document data into one or more document segments;identifying at least one of the one or more document segments as analias for a document datum found in an alias directory service; andassociating the received document with the document alias so that, uponrequest for the document datum through a search engine, the receiveddocument is returned to the requester by association of the documentdatum with the alias.
 16. The computer program product of claim 15,wherein the document data are metadata, and wherein the alias is adocument metadatum.
 17. The computer program product of claim 15,further comprising identifying a secondary document reference containedwithin the received document; parsing the secondary document intosecondary document segments; identifying the secondary document segmentwith a secondary alias, and associating the secondary document segmentwith the secondary alias, the secondary document, and the receiveddocument.
 18. The computer program product of claim 15, wherein thealias directory service is a contact database containing one or morealiases for one or more terms associated with one or more correspondingcontacts.
 19. The computer program product of claim 15, furthercomprising identifying the document segment as part of a predefinedclass or a class alias, so that a data request through a search enginereturns the requested data to the requester when the requester entersone or more of the identified class, the class alias, and the alias. 20.The computer program product of claim 19, wherein the class is one ormore of a weighted value for one or more associated terms, a metadataconcept, and a property type.
 21. The computer program product of claim19, further comprising associating a term in an inverted index with oneor more of the identified alias and the predefined class or class alias;and storing the inverted index for use by a search engine.
 22. Thecomputer program product of claim 20, wherein the property type is anauthorship property.
 23. The computer program product of claim 20,wherein a classification module further implements the methodcomprising: identifying a next document containing next document datathat can be identified with the class, whereby the class comprises atleast the document containing document data and the next documentcontaining next document data; and based on the document data and thenext document data, identifying additional documents within the class,so that the classification module is trained to associate additionaldocuments with the class that would not have otherwise been identified.24. A computer program product having computer-executable instructionsfor performing a method of normalizing document data to improve theresults of search requests, the method comprising: an act of receiving adocument containing document data; an act of parsing the document datainto one or more document segments; and a step for normalizing documentmetadata used as a reference by a search engine by maintaining one ormore relationships between a search term and an alternate search term, asearch term property or alternative search term property.
 25. Thecomputer program product of claim 24, wherein the step for improvingfuture search results returned to a requester of a requested termincludes: an act of identifying at least one of the document segments asan alias for a document datum found in an alias directory service; andan act of associating the received document with the document alias sothat, upon request for the document datum through a search engine, thereceived document is returned to the requester by association with thealias.
 26. The computer program product of claim 24, further comprisingreceiving, by a gatherer module, directory service data that include oneor more aliases for a metadatum.
 27. The computer program product ofclaim 26, further comprising parsing the directory service data; andassociating the parsed data with one or more classes so that one or moredocuments are related by one or more corresponding classes and one ormore metadata aliases.
 28. The computer program product of claim 27,wherein the directory service data are contained in one or more of acontact database and a text file having delimited values, wherein thedelimited values equate one or more alternative terms for a normalizedvalue.